Re: Zedstore - compressed in-core columnar storage - Mailing list pgsql-hackers

From DEV_OPS
Subject Re: Zedstore - compressed in-core columnar storage
Date
Msg-id 99fefcc5-8320-a5d1-5d6d-cbe339a2f151@ww-it.cn>+C9B38613AA4CDF77
Whole thread Raw
In response to Re: Zedstore - compressed in-core columnar storage  (Ashwin Agrawal <aagrawal@pivotal.io>)
List pgsql-hackers
it's really cool and very good progress, 

I'm interesting if SIDM/JIT will be supported

best wishes

TY


On 2019/5/23 08:07, Ashwin Agrawal wrote:
> We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
> share the recent additions and modifications. Attaching a patch
> with the latest code. Link to github branch [1] to follow
> along. The approach we have been leaning towards is to build required
> functionality, get passing the test and then continue to iterate to
> optimize the same. It's still work-in-progress.
>
> Sharing the details now, as have reached our next milestone for
> Zedstore. All table AM API's are implemented for Zedstore (except
> compute_xid_horizon_for_tuples, seems need test for it first).
>
> Current State:
>
> - A new type of item added to Zedstore "Array item", to boost
>   compression and performance. Based on Konstantin's performance
>   experiments [2] and inputs from Tomas Vodra [3], this is
>   added. Array item holds multiple datums, with consecutive TIDs and
>   the same visibility information. An array item saves space compared
>   to multiple single items, by leaving out repetitive UNDO and TID
>   fields. An array item cannot mix NULLs and non-NULLs. So, those
>   experiments should result in improved performance now. Inserting
>   data via COPY creates array items currently. Code for insert has not
>   been modified from last time. Making singleton inserts or insert
>   into select, performant is still on the todo list.
>
> - Now we have a separate and dedicated meta-column btree alongside
>   rest of the data column btrees. This special or first btree for
>   meta-column is used to assign TIDs for tuples, track the UNDO
>   location which provides visibility information. Also, this special
>   btree, which always exists, helps to support zero-column tables
>   (which can be a result of ADD COLUMN DROP COLUMN actions as
>   well). Plus, having meta-data stored separately from data, helps to
>   get better compression ratios. And also helps to further simplify
>   the overall design/implementation as for deletes just need to edit
>   the meta-column and avoid touching the actual data btrees. Index
>   scans can just perform visibility checks based on this meta-column
>   and fetch required datums only for visible tuples. For tuple locks
>   also just need to access this meta-column only. Previously, every
>   column btree used to carry the same undo pointer. Thus visibility
>   check could be potentially performed, with the past layout, using
>   any column. But considering overall simplification new layout
>   provides it's fine to give up on that aspect. Having dedicated
>   meta-column highly simplified handling for add columns with default
>   and null values, as this column deterministically provides all the
>   TIDs present in the table, which can't be said for any other data
>   columns due to default or null values during add column.
>
> - Free Page Map implemented. The Free Page Map keeps track of unused
>   pages in the relation. The FPM is also a b-tree, indexed by physical
>   block number. To be more compact, it stores "extents", i.e. block
>   ranges, rather than just blocks, when possible. An interesting paper [4]
> on
>   how modern filesystems manage space acted as a good source for ideas.
>
> - Tuple locks implemented
>
> - Serializable isolation handled
>
> - With "default_table_access_method=zedstore"
>   - 31 out of 194 failing regress tests
>   - 10 out of 86 failing isolation tests
> Many of the current failing tests are due to plan differences, like
> Index scans selected for zedstore over IndexOnly scans, as zedstore
> doesn't yet have visibility map. I am yet to give a thought on
> index-only scans. Or plan diffs due to table size differences between
> heap and zedstore.
>
> Next few milestones we wish to hit for Zedstore:
> - Make check regress green
> - Make check isolation green
> - Zedstore crash safe (means also replication safe). Implement WAL
>   logs
> - Performance profiling and optimizations for Insert, Selects, Index
>   Scans, etc...
> - Once UNDO framework lands in Upstream, Zedstore leverages it instead
>   of its own version of UNDO
>
> Open questions / discussion items:
>
> - how best to get "column projection list" from planner? (currently,
>   we walk plan and find the columns required for the query in
>   the executor, refer GetNeededColumnsForNode())
>
> - how to pass the "column projection list" to table AM? (as stated in
>   initial email, currently we have modified table am API to pass the
>   projection to AM)
>
> - TID treated as (block, offset) in current indexing code
>
> - Physical tlist optimization? (currently, we disabled it for
>   zedstore)
>
> Team:
> Melanie joined Heikki and me to write code for zedstore. Majority of
> the code continues to be contributed by Heikki. We are continuing to
> have fun building column store implementation and iterate
> aggressively.
>
> References:
> 1] https://github.com/greenplum-db/postgres/tree/zedstore
> 2]
> https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru
> 3]
> https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development
> 4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf
>





pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: coverage increase for worker_spi
Next
From: Tomas Vondra
Date:
Subject: Re: error messages in extended statistics